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Abstract 

We formulate the local ranking problem in the framework of bipartite ranking 
where the goal is to focus on the best instances. We propose a methodology based on 
the construction of real- valued scoring functions. We study empirical risk minimiza- 
tion of dedicated statistics which involve empirical quantiles of the scores. We first 
state the problem of finding the best instances which can be cast as a classification 
problem with mass constraint. Next, we develop special performance measures for the 
local ranking problem which extend the Area Under an ROC Curve (AUC/AROC) 
criterion and describe the optimal elements of these new criteria. We also highlight 
the fact that the goal of ranking the best instances cannot be achieved in a stage- 
wise manner where first, the best instances would be tentatively identified and then a 
standard AUC criterion could be applied. Eventually, we state preliminary statistical 
results for the local ranking problem. 
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1 Introduction 



The first takes all the glory, the second takes nothing. In apphcations where ranking is at 
stake, people often focus on the best instances. When scanning the results from a query 
on a search engine, we rarely go beyond the one or two first pages on the screen. In 
the different context of credit risk screening, credit establishments elaborate scoring rules 
as reliability indicators and their main concern is to identify risky prospects especially 
among the top scores. In medical diagnosis, test scores indicate the odds for a patient 
to be healthy given a series of measurements (age, blood pressure, ...). There again 
a particular attention is given to the "best" instances not to miss a possible diseased 
patient among the highest scores. These various situations can be formulated in the setup 
of bipartite ranking where one observes i.i.d. copies of a random pair (X, Y) with X being 
an observation vector describing the instance (web page, debtor, patient) and Y a binary 
label assigning to one population or the other (relevant vs. non relevant, good vs. bad, 
healthy vs. diseased). In this problem, the goal is to rank the instances instead of simply 
classifying them. There is a growing literature on the ranking problem in the field of 
Machine Learning but most of it considers the Area under the ROC Curve (also known as 
the AUC or AROC) criterion as a measure of performance of the ranking rule [6} [T3t [26} [T] . 
In a previous work, we have mentioned that the bipartite ranking problem under the AUC 
criterion could be interpreted as a classification problem with pairs of observations [1]. 
But the Umit of this approach is that it weights uniformly the pairs of items which are 
badly ranked. Therefore it does not permit to distinguish between ranking rules making 
the same number of mistakes but in very different parts of the ROC curve. The AUC 
is indeed a global criterion which does not allow to concentrate on the "best" instances. 
Special performance measures, such as the Discounted Cumulative Gain (DCG) criterion, 
have been introduced by practitioners in order to weight instances according to their 
rank [16] (see also [251 [7]) but providing theory for such criteria and developing empirical 
risk minimization strategies still is a very open issue. In the present paper, we extend 
the results of our previous work in [4] and set theoretical grounds for the problem of 
local ranking. The methodology we propose is based on the selection of a real-valued 
scoring function for which we formulate appropriate performance measures generahzing 
the AUC criterion. We point out that ranking the best instances is an involved task as 
it is a two-fold problem: (i) find the best instances and (ii) provide a good ranking on 
these instances. The fact that these two goals cannot be considered independently will 
be highlighted in the paper. Despite this observation, we will first formulate the issue of 
finding the best instances which is to be understood as a toy problem for our purpose. This 
problem corresponds to a binary classification problem with a mass constraint (where 
the proportion uq of +1 labels predicted by the classifiers is fixed) and it might present 
an interest per se. The main comphcation here has to do with the necessity of performing 
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quantile estimation which affects the performance of statistical procedures. Our proof 
technique was inspired by the former work of Koul [18] in the context of R-estimation 
where similar statistics arise. 

The rest of the paper is organized as follows. We first state the problem of finding 
the best instances and study the performance of empirical risk minimization in this setup 
(Section [2]). We also explore the conditions on the distribution in order to recover fast 
rates of convergence. In Section [3] we formulate performance measures for local ranking 
and provide extensions of the AUG criterion. Eventually (Section |4]), we state some 
preliminary statistical results on empirical risk minimization of these new criteria. 

2 Finding the best instances 

In the present section, we have a limited goal which is only to determine the best instances 
without bothering of their order in the Ust. By considering this subproblem, we will 
identify the main technical issues involved in the sequel. It also permits to introduce the 
main notations of the paper. 

Just as in standard binary classification, we consider the pair of random variables 
{X,Y) where X is an observation vector in a measurable space X and Y is a binary label 
in {— 1,+1}. The distribution of (X,Y) can be described by the pair [[i,r\) where \x is 
the marginal distribution of X and r| is the a posteriori distribution defined by ri(x) = 
P{Y = 1 I X = x}, Vx G Af. We define the rate of best instances as the proportion of best 
instances to be considered and denote it by uo G (0, 1). We denote by Q(r\, 1 — uq) the 
(1 — uo)-quantile of the random variable r\{X). Then the set of best instances at rate uq 
is given by: 

C^:^={xe;f |ri(x)>Qh,1-uo)}. 
We mention two trivial properties of the set which will be important in the sequel: 

• Mass constraint: we have M-(C;^) = P{X e C;^} = uq, 

• Invariance PROPERTY: as a functional of ri, the set is invariant by strictly 
increasing transforms of ri. 

The problem of finding a proportion uq of the best instances boils down to the es- 
timation of the unknown set on the basis of empirical data. Before turning to the 
statistical analysis of the problem, we first relate it to binary classification. 

2.1 A classification problem with a mass constraint 

A classifier is a measurable function g : A:" — > {— 1,+1} and its performance is measured 
by the classification error L(g) = P{Y7^g(X)}. Let uq G (0,1) be fixed. Denote by 
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= 2Ic*p — 1 the classifier predicting +1 on the set of best instances and -1 on its 
complement. The next proposition shows that g!^ is an optimal element for the problem 
of minimization of L(g) over the family of classifiers g satisfying the mass constraint 

P{g(X) = l} = uo. 

Proposition 1 For any classifier g : A" — > {— 1,+1} such that g(x) = HqM — 1 for 
some subset C of X and ]^.{C) = P{g(X) = 1} = uo, we have 

Furthermore, we have 

v^ = ^-Q{^,^-llo) + [^ - uo)(2Q(ti,i - uo) - 1) -e(|ti(x) - q^j - uo)|) , 

and 

L(g] - L(gt^) = 2E (|ti(X] - Qh, 1 - uo)| Ici„Ac(X)) , 
where A denotes the symmetric difference operation between two subsets of X. 

PROOF. For simplicity, we temporarily change the notation and set q = Q(ri,l — uo). 
Then, for any classifier g satisfying the the constraint P{g(X) = 1} = uo, we have 

L(g) =E(h(X)-q)I[g(x)=-i] + (q-Ti(X))I[g(x)=+i]) + (1 - uo)q + (1 - q)uo . 

The statements of the proposition immediately follow. | 

There are several progresses in the field of classification theory where the aim is to 
introduce constraints in the classification procedure or to adapt it to other problems. We 
relate our formulation to other approaches in the following remarks. 

Remark 1 (Connection to hypothesis testing). The implicit asymmetry in the 
problem due to the emphasis on the best instances is reminiscent of the statistical theory 
of hypothesis testing. We can formulate a test of simple hypothesis by taking the null 
assumption to be Hq : Y = +1 and the alternative assumption being Hi : Y = —1. 
We want to decide which hypothesis is true given the observation X. Each classifier g 
provides a test statistic g(X). The performance of the test is then described by its type I 
error a(g) = P{g(X) = 1 | Y = -1} and its power |3(g) = F{g(X) = 1 | Y = +1}. We point 
out that if the classifier g satisfies a mass constraint, then we can relate the classification 
error with the type I error of the test defined by g through the relation: 

l-(g) =2(1 -p)a(g) +p -Uo 

where p = P{Y = 1}, and similarly, we have: L(g) = 2p(1 — |3(g)) — p — uq. Therefore, the 
optimal classifier minimizes the type I error (maximizes the power) among all classifiers 
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with the same mass constraint. In some appUcations, it is more relevant to fix a constraint 
on the probability of a false alarm (type I error) and maximize the power. This question 
is explored in a recent paper by Scott [27j (see also [29]). 

Remark 2 (Connection with regression level set estimation) We mention that 
the estimation of the level sets of the regression function has been studied in the statistics 
hterature [3] (see also [32], [38]) as well as in the learning literature, for instance in the 
context of anomaly detection ([311 [28l [37]). our framework of classification with mass 
constraint, the threshold defining the level set involves the quantile of the random variable 

Remark 3 (Connection with the minimum volume set approach) Although the 
point of view adopted in this paper is very different, the problem described above may be 
formulated in the framework of minimum volume sets learning as considered in [30]. As 
a matter of fact, the set may be viewed as the solution of the constrained optimization 
problem: 

mmP{X e C I Y = -1) 
over the class of measurable sets C, subject to 

p{x e c} > uo . 

The main difference in our case comes from the fact that the constraint on the volume set 
has to be estimated using the data while in [30j it is computed from a known reference 
measure. We believe that learning methods for minimum volume set estimation may 
hopefully be extended to our setting. A natural way to do it would consist in replacing 
conditional distribution of X given Y = — 1 by its empirical counterpart. This is beyond 
the scope of the present paper but will be the subject of future investigation. 

2.2 Empirical risk minimization 

We now investigate the estimation of the set of best instances at rate uo based on 
training data. Suppose that we are given n i.i.d. copies (Xi,Yi),-- - ,(Xtv, Yn) of the 
pair (X,Y]. Since we have the ranking problem in mind, our methodology will consist in 
building the candidate sets from a class S of real- valued scoring functions s : A:" — > M. 
Indeed, we consider sets of the form 

Cs = Cs,^^ = {X e A- I S(X) > Q(S, 1 - uo)} , 

where s is an element of 5 and Q(s, 1 — uo) is the (1 — uo)-quantile of the random variable 
s(X]. Note that such sets satisfy the same properties of with respect to mass constraint 
and invariance to strictly increasing transforms of s. 



5 



Prom now on, we will take the simplified notation: 

L(s) = L(s,uo) = L(Cs) = F{Y • (s(X) - Q(s, 1 - uo)) < 0} . 
A scoring function minimizing the quantity 

Ms) = -fl{Yi. (s(X,) - Q(s,l - uo)) < 0}. 

is expected to approximately minimize the true error L(s), but the quantile depends 
on the unknown distribution of X. In practice, one has to replace Q(s,l — uq) by its 
empirical counterpart Q(s,1 — uo) which corresponds to the empirical quantile. We will 
thus consider, instead of Ln(s), the truly empirical error: 

tn(s) =-f_ mi ■ (s(Xi) - 0(S, 1 - uo)) < 0}. 

u ^ — 

i=l 

Note that tn(s) is a comphcated statistic since the empirical quantile involves all the 
instances Xi, . . . ,Xn. We also mention that trtls) is a biased estimate of the classification 
error L(s) of the classifier QsM = 21{s(x) > Q(s, 1 — uo)} — 1 . 

We introduce some more notations. Set, for all t G M: 

. FJt) =F{s(X) <t} 

. Gs(t) = F{s(X) <t I Y = +l} 

. Hs(t) = F{s(X) <t I Y = -l} 

to be the cumulative distribution functions (cdf) of s(X) (respectively, given Y = 1, given 
Y = — 1). We recall that the definition of the quantiles of (the distribution of) a random 
variable involves the notion of generahzed inverse F^^ of a function F: 

F-i (z) = inf{t 6 M I F(t) > z} . 

Thus, we have, for all v 6 (0, 1 ): 

Q[s,v]=V-'{v) and 0(s,v)=V(v) 

where is the empirical cdf of s(X): "^Jt) = ^ Z.i=^ II{s(Xi) < t}, Vt e M. 

Without loss of generality, we will assume that all scoring functions in <S take their 
values in (0,A) for some A > 0. We now turn to study the performance of minimizers of 
tn(s) over a class S of scoring functions defined by 

§~Ti = arg min tn(s). 
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Our first main result is an excess risk bound for the empirical risk minimizer tn over 
a class cS of uniformly bounded scoring functions. In the following theorem, we consider 
that the level sets of scoring functions from the class S form a Vapnik-Chervonenkis (VC) 
class of sets. 

Theorem 2 We assume that 

(i) the class S is symmetric (i.e. i/s G <S then A — s G <Sj and is a VC major class 
of functions with VC dimension V. 

(ii) the family Ar = {Gs,Hs : sG5}o/ cdfs satisfies the following property: any 
K G AT has left and right derivatives, denoted by and and there exist 
strictly positive constants b, B such that V(K,t) G /C x (0,A), 

b<|K+(t)|<B and b < |Ki(t)| < B . 
For any 8 > 0, we have, with probability larger than 1 — 6, 



L(^.)-infL(s)<c.JX + cJ^^, 
SG5 y TL \i n 

for some positive constants Ci,C2. 

We now provide some insights on conditions (i) and (ii) of the theorem. 

Remark 4 (on the complexity assumption) On the terminology of major sets and 
major classes, we refer to Dudley [lOj. In the proof, we need to control empirical processes 
indexed by sets of the form {x : s(x) > t} or {x : s(x) < t}. Condition (i) guarantees 
that these sets form a VC class of sets. 

Remark 5 (on the choice of the class S of SCORING functions) In order to grasp 
the meaning of condition (ii) of the theorem, we consider the one-dimensional case with 
real-valued scoring functions. Assume that the distribution of the random variable Xi 
has a bounded density f with respect to Lebesgue measure. Assume also that scoring 
functions s are differentiable except, possibly, at a finite number of points, and derivatives 
are denoted by s'. Denote by fg the density of s(X). Let t G (0,A] and denote by xi, 
Xp the real roots of the equation s(x) = t. We can express the density of s(X) thanks to 
the change-of- variable formula (see e.g. [24]): 



f(xi) , , f(x. 



This shows that the scoring functions should not present neither fiat nor steep parts. We 
can take for instance, the class S to be the class of hnear-by-parts functions with a finite 
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Scoring (unction 




number of local extrema and with uniformly bounded left and right derivatives: Vs 6 S, 
Vx, m < s'|_(x) < M and m < s'_(x) < M for some strictly positive constants m, and M 
(see Figure [I]). Note that any subinterval of [0, A] has to be in the range of scoring functions 
s (if not, some elements of K, will present a plateau). In fact, the proof requires such a 
behavior only in the vicinity of the points corresponding to the quantiles Q(s, 1 — uo) for 
all s eS. 

PROOF. Set vo = 1 — uq. By a standard argument (see e.g. [8j), we have: 

L(^n) - inf L(s) < 2 sup tn(s) - L(s; 
se-s ses 

<2sup tn(s)-Ln(s) + 2 SUp |Ln(s) - L(s 



sG5 



s£5 



Note that the second term in the bound is an empirical process whose behavior is 
well-known. In our case, assumption (i) implies that the class of sets {x : s(x) > Q(s,vo)} 
indexed by scoring functions s has a VC dimension smaller than V. Hence, we have by a 
concentration argument combined with a VC bound for the expectation of the supremum 
(see, e.g. [20]), for any 6 > 0, with probability larger than 1 — 6, 

sup|Ln(s) - L(s)| < c 

for universal constants c,c'. 

We now show how to handle the first term. Following the work of Koul [18], we set 
the following notations: 

M(s,v] = P{Y- (s(X)-Q(s,v)) <0} 
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Un(s,v) = -tl{Yi- (s(Xt)-Q(s,v)) <0]-M[sr 
n r— ■ 



and note that Un(s,v) is centered. 

We then have the following decomposition, for any s e cS and vq G (0,1' 



< 



Un(s,F3of7i(vo))-Un(s,vo) + M(s, o ^ (vq)) - M(s 



Note that M(s, Fs o"^-! [vq)) = F {Y • (s(X) - 0(s, v)) < | Dn} where denotes the 
sample (Xi,Yi),--- ,(X^,YJ. 

Recall the notation p =F{Y= 1}. Since M(s,v) = (l-p)(l-HsoF7i(v))+pGsoF-i(v) 
and Fs = pGs + (l— p)Hs, the mapping v i— > M(s,v) is Lipschitz by assumption (ii). Thus, 
there exists a constant k < oo, depending only on p, b and B, such that: 



M(s,Fs0^7T(vo))-M(s,vo] 
Moreover, we have, for any s G <S: 



< K 



F. o f: 



IvoJ -voJ 



Fs o J (vo) - vo < FsofJ(vo) 



< sup 

tG(0,A) 



Fs(t)-^s(t) 



+ 



n 



Here again, we can use assumption (i) and a classical VC bound from [20] in order to 
control the empirical process, with probabihty larger than 1 — 6: 



sup 

(s,t)G5x{0,A) 



F.s(t)-f.s(tl 



<cw^ + cVM2Z^ 

n V TL 



for some constants c,c'. 

It remains to control the term involving the process 11^: 

Utl(s,Fs of7^(vo)) -Un(s,Vo) < sup |Un(s, v) - Un(s, Vo)| 

ve{o,i) 

< 2 sup |Un(s,v)| 
VG(0,1) 

Using that the class of sets of the form {x : s(x) > Q(s,v)} for v G (0, 1 ) is included in 
the class of sets of the form {x : s(x) > t} where t G (0, A), we then have 



sup |UrL(s,v)| < sup 
ve{o,i) te(o,A) 



- Y_ HYi ■ (s(Xt) - 1) < 0} - F {Y • (s(X] - 1) < 0} 



which leads again to an empirical process indexed by a VC class of sets and can be bounded 
as before. | 
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2.3 Fast rates of convergence 

We now propose to examine conditions leading to fast rates of convergence (faster than 
n^^/^). It has been noticed (see [21], [33], [23]) that it is possible to derive such rates of 
convergence in the classification setup under additional assumptions on the distribution. 
We propose here to adapt these assumptions for the problem of classification with mass 
constraint. 

Our concern here is to formulate the type of conditions which render the problem 
easier from a statistical perspective. For this reason and to avoid technical issues, we will 
consider a quite restrictive setup where it is assumed that: 

1. the class S of scoring functions is a finite class with N elements, 

2. an optimal scoring rule s* is contained in S. 

We have found that the following additional conditions on the distribution and the 
class <S allow to derive fast rates of convergence for the excess risk in our problem. 

(iii) There exist constants a G (0, 1) and B > such that, for all t > 0, 

F{|ri(X)-Qh,l -uo)| <t}<BtT^ . 

(iv) the family /C = { Gs,Hs : s G cS } of cdfs satisfies the following property: for any 
s G <S, Gs and Hg are twice differentiable at Q(s, 1 — uo) = (1 — uo). 

We point out that conditions (ii) and (iii) are not completely independent. Indeed, 
if (Gti,Ht^) belongs to the class IC fulfilling condition (ii), then F^^ = pGn + (1 — p)Hti is 
Lipschitz and condition (iii) is satisfied with a = 1/2. Note that condition (iii) simply 
extends the standard low noise assumption introduced by Tsybakov [33] (see also [2] for an 
account on this) where the level 1/2 is replaced by the (1 — uo]-quantile of ri(X). Indeed, 
we have, under condition (iii), the variance control, for any s G <S: 

Var(I{Y^ 2Ics(X) - 1} - I{Y ^ 2Ic* JX) - 1}) < c (L(s] - Lt^)'^ , 

or, equivalently, 

E(Ic.ac;.„(X))<c(L(s)-L:^)«. 

Now, if we denote 

Sn = argminLn(s) , 

ses 

we have, by a standard argument based on Bernstein's inequality (see Section 5.2 in [2]), 
with probabiUty 1—6, 

f log{N/8) \^ 
L(sn)-L^<c(^ J 
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for some positive constant c. 

The novel part of the analysis below hes in the control of the bias induced by plugging 
the empirical quantile Q(s,l — uq) in the risk functional. The next theorem shows that 
faster rates of convergence can be obtained under the previous assumptions with a = 1/2. 

Theorem 3 We assume that the class S of scoring functions is a finite class with 
N elements, and that it contains an optimal scoring rule s*. Moreover, we assume 
that conditions (i)-(iv) are satisfied. Then, for any b > 0, we have, with probability 



for some constant c. 

Remark 6 (on the rate n^^/^) The previous results highlights the fact that rates 
faster than the one obtained in Theorem [2] can be obtained in this setup with additional 
regularity assumptions. However, it is noteworthy that the standard low noise assumption 
(iii) is already contained in assumption (ii) which is required in proving the typical n^^/^ 
rate. The consequence of this observation is there is no hope of getting rates up to n^^ 
unless assumption (ii) is weakened. 

The proof of the previous theorem is based on two arguments: the structure of linear 
signed rank statistics and the variance control assumption. The situation is similar to 
the one we encoutered in [5j where we were dealing with U-statistics and we had to invoke 
Hoeffding's decomposition in order to grasp the behavior of the underlying U-processes. 
Here we require a similar argument to describe the structure of the empirical risk functional 
tn(s] under study. This statistic can be interpreted as a linear signed rank statistic and 
the key decomposition has been used in the context of nonparametric hypotheses testing 
and R-estimation. We mainly refer to Hajek and Sidak [H], Dupac and Hajek [11], Koul 
[17] , Koul and Staudte [19] for an account on rank statistics. 

We briefly go through the main ideas, but first we need to introduce some notations. 



Set: 

Vve [0,1] , K(s,v) =E(YI{s(X) < Q(s,v)}) = pGs(Q(s,v)) - (1 - p)Hs(Q(s, v)) 



1 -6.- 



2 





i=l 



Then we can write: 



L(s) = 1 -p + K(s,l -uo) 
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Cn(s) = h1^n(s,l -Uo) , 

n 

where n_ = Y.i=^ U^i = -l)- 

We note that the statistic t^nis) is related to hnear signed rank statistics. 

Definition 4 [Linear signed rank statistic]. Consider Zi,...,Zn an i.i.d. sample 
with distribution F and a real-valued score generating function O. Denote by = 
rank[\Zi\) the rank of \Zi\ in the sample |Zi[, . . . , \Zn\. Then the statistic 

is a linear signed rank statistic. 

Proposition 5 For fixed s andv, the statistic 1^n(s, v) is a linear signed rank statistic. 

PROOF. Take Zi = Yis(Xt). The random variables Zi have their absolute value distributed 
according to Fg and have the same sign as Yt. It is easy to see that the statistic RnlSjV] 
is a linear signed rank statistic with score generating function 0(x) = I[x<v]- I 

A decomposition of Hoeffding's type for such statistics can be formulated. Set first: 
1 

Zn(s,v) = - ^ (Yi - K'(s,v)) I{s(X,) < Q(s,v)} - K(s,v) + vK'(s,v) , 

where K'(s,v) denotes the derivative of the function v i— ) K(s,v). Note that Zn(s,v) is a 
centered random variable with variance: 

a^(s,v) =v-K(s,v)^ + v(1 -v)K'^(s,v) -2(1 - v)K'(s,v)K(s, v) . 

The next result is due to Koul [T7j and we provide an alternate proof in the Appendix. 
Proposition 6 We have, for all s ^ S and v e [0, 1].' 

Rn(s,v) =K(S,V)+Zn(s,v)+An(s) . 

with 

An(s) = Of[n^^] as n — > c» . 
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This asymptotic expansion highlights the structure of the statistic Ltv(s) for fixed s. 
The leading term Zn(s, 1 — uo) is an empirical average of i.i.d. random variables and it 
provides the asymptotic variance of Ltv(s). It is worth noticing that Zn[s, 1 — uo) is not 
reduced to the same empirical functional with the true, instead of the empirical, quantile 
but it also involves a derivative term. Since the remainder term An(s) is of the order 
it will not affect the final rate of convergence under low noise conditions. Therefore, the 
variance control assumption concerns the variance of the function involved in the empirical 
average Zn(s,1 —uq). 

We denote by: 

Hs(Xi,Y,) = (Yi-K'(s,v))I{s(Xi) < Q(s,v)}-K(s,v)+vK'(s,v) , 
and we then have 

1 

Zn(s,v)-Zn(s*,v) = -^(h3(X,,Y0-h3*(Xt,Y0) . 

i=l 

Proposition 7 Fixv G [0,1]. Assume that condition (Hi) holds. Then, we have, for 
all s e cS.' 

Var(hs(Xt,Yt)-hs.(Xt,Yt)) < c(L(s) - L(s*))" , 
for some constant c. 

PROOF. We first write that: 

hs(Xi, Yi) - hs* (Xt, Yt) = I + II + III + IV + V 

where 

I = Y, (I{s(X,) < Q(s,v)} - I{s*(Xi) < Q(s*,v)}) 

II = (K'(s*,v)-K'(s,v)) I{s*(Xt) <Q(s*,v)} 

III = K'(s,v) (I{s*(X|) < Q(s*,v)}-I{s(X,) < Q(s,v)}) 

IV = K(s*,v) -K(s,v) 

V = V (K'(s,v) -K'(s*,v)) . 

By Cauchy-Schwarz inequality, we only need to show that the expected value of the 
square of these quantities is smaller than (L(s) — L*)" up to some multiplicative constant. 
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Note that, by definition of K, we have: 

II = (L'(s*,v) -L'(s,v)) I{s*(Xi) < Q(s*,v)} 



IV = L(s*)-L(s) 

V = V (L'(s,v) -L'(s*,v)) 

where L'(s,v) denotes the derivative of the function v i-) L(s,v). It is clear that, for any 
s, we have L(s,v) = L(s*,v) implies that L'(s,v) = L'(s*,v) otherwise s* would not be an 
optimal scoring function at some level v' in the vicinity of v. Therefore, since S is finite, 
there exists a constant c such that 

[L'[s,v)-V[s*,v)f < c[L[s)-V)'^ 

and then E(II^) and E(V^)^ are bounded accordingly. 
Moreover, we have: 

E(l2) <E(Ic.aqJX)) 

<c(L(s)-L(s*))« 

for some positive constant c, by assumption (iii). 

Eventually, by assumption (ii), we have that K'(s,v) is uniformly bounded and thus, 
the term E(III^) can be handled similarly. | 

Proof of Theorem [S], The proof is the same as the one of Theorem 5 from [5] which 
uses a result by Massart [2^. | 

3 Performance measures for local ranking 

Our main interest here is to develop a setup describing the problem of not only finding but 
also ranking the best instances. As far as we know, this problem has not been considered 
from a statistical perspective until now. In the sequel, we build on the results from Section 
[2] and also on our previous work on the (global) ranking problem [5j in order to capture 
some of the features of the local ranking problem. The present section is devoted to the 
construction of performance measures reflecting the quahty of ranking rules on a restricted 
set of instances. 
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3.1 ROC curves and optimality in the local ranking problem 

We consider the same statistical model as before with (X,Y) being a pair of random 
variables over x{— 1 , +1} and we examine ranking rules resulting from real- valued scoring 
functions s : —> (0,A). The reference tool for assessing the performance of a scoring 
function s in separating the two populations (positive vs. negative labels) is the Receiving 
Operator Characteristic known as the ROC curve ([36], [E])- If we take the notations 
Gs(z) = P{s(X) >z I Y = 1} (true positive rate) and Hs(z) = P{s(X) >z \ Y = -1} 
(false positive rate), we can define the ROC curve, for any scoring function s, as the plot 
of the function: 

(Hs(z),Gs(z)) 

for thresholds z e (0, A), or equivalently as the plot of the function: 

tH4 GsoH<r''(l -t) 

for t e (0,1). The optimal scoring function is the one whose ROC curve dominates all 
the others for all z G (0,A) (or t G (0,1)) and such a function actually exists. Indeed, 
by recalhng the hypothesis testing framework in the classification model (see Remark [T]) 
and using Neyman-Pearson's Lemma, it is easy to check that ROC curve of the function 
r|(x) = P{Y = 1 I X = x} dominates the ROC curve of any other scoring function. We 
point out that the ROC curve of a scoring function s is invariant by strictly increasing 
transformations of s. 

In our approach, for a given scoring function s, we focus on thresholds z corresponding 
to the cut-off separating a proportion u e (0, 1 ) of top scored instances according to s from 
the rest. Recall from Section [2] that the best instances according to s are the elements of 
the set Cs,it = [x e \ s(x) > Q(s,l — u)} where Q(s,l — u) is the (1 — u)-quantile of 
s(X). We set the following notations: 

a(s,u) =P{s(X) > Q(s,l -u) I Y = -1} 
|3(s,u) =P{s(X) > Q(s,l -u) I Y = +1} . 

We propose to re-parameterize the ROC curve with the proportion u G (0,1) and then 
describe it as the plot of the function: 

u (a(s,u), |3(s,u)) , 

for each scoring function s. When focusing on the best instances at rate uo, we only 
consider the part of the ROC curve for values u G (0,uo). 

However attractive is the ROC curve as a graphical tool, it is not a practical one for 
developing learning procedures achieving straightforward optimization. The most natural 
approach is to consider risk functionals built after the ROC curve such as the Area Under 
an ROC Curve (known as the AUC or AROC, see [E]). Our goals in this section are: 
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Figure 2: ROC curves, line D(uo,p) and truncated AUG at rate uq of best instances. 

1. to extend the AUG criterion in order to focus on restricted parts of the ROG curve, 

2. to describe the optimal elements with respect to this extended criterion. 

We point out the fact that extending the AUG is not trivial. Indeed, we notice that 
a(s,u) and |3(s,u) are related by a linear relation, for fixed u and p, when s varies: 

u = p(3(s,u) + (1 — p)a(s,u) 

where p = F{Y = 1}. We denote the line plot of this relation by D(u,p). Hence, the part 
of the ROG curve of a scoring function s corresponding to the best instances at rate uo is 
the part going from the origin (0,0) to the intersection between the line D(uo,p) and the 
ROG curve (shaded area in the left display of Figure [2]). It follows that, the closer to ri the 
scoring function s is, the higher the ROG curve is, but at the same time the integration 
domain shrinks (right display of Figure [2]). 

Our guideline in defining risk criteria for the problem of ranking the best instances is 
the form of the optimal elements. We expect the optimal scoring functions at the rate 
Uo to belong to the equivalence class (functions defined up to the composition with a 
nondecreasing transformation) defined by scoring functions s* such that: 




ri(x) if X e C* 

< inf ri if X ^ C* 



Such scoring functions fulfill the two properties of finding the best instances (indeed 
Cs*,uo = C^) and ranking them as well as the regression function. We will denote by <S* 
the set of optimal scoring functions for the problem of ranking the best instances at the 
rate uq. 
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As a preliminary result, and before proposing an adequate criterion, we formulate a 
simple lemma. 

Lemma 8 For any scoring function s, we have for all u G (0, 1), 

(3(s,u) < (3(ri,u) 
a(s,u) > (x[r[,u) . 

Moreover, we have equality only for those s such that Cs,u<, = . 

PROOF. We show the first inequality. By definition, we have: 

|3(s,u) = 1 -Hs(Q(s,1 -u)) . 
Observe that, for any scoring function s, 

p(l -Hs(Q(s,1 -u)) =P{Y=l,s(X) >Q(s,1 -u)} 

= E(Ti(x)i{xe Cs,u}) . 

We thus have 

p(Hs(Q(s,1 -u)-H^(Qh,l -u)) 

= E(n(x)(i{Xe ct,}-i{xe c,,j)) 

= Eh(x)i{x^ cy(i{xe cy-i{xe Cs,u})) 

+E(Ti(x)i{xe cy(i{xe cy-i{xe Cs,J)) 

> -E ( Q (ti , 1 - u)I{X ^ c*j i{x e Cs, J) + E ( Q (ti , 1 - u)I{X e ctj( 1 - i{x e Cs,u}) ) 

= Q(ti,1 -u)(1 -u- 1 +u) = . 
The second inequality simply follows from the identity below: 

1 -u = pHs(Q(s,1 -u)) + (l -p)Gs(Q(s,1 -u)) . I 

In view of this result, a wide collection of criteria with the set S* as the set of optimal 
elements could naturally be considered, depending on how one wants to weight the two 
types of error 1 — |3(s,u) = (type II error in the hypothesis testing framework) and a(s,u] 
(type I error) according to the rate u G [0,uo]. However, not all the criteria obtained in 
this manner can be interpreted as generaUzations of the AUG criterion for uo = 1 . 
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3.2 Generalization of the AUG criterion 

In [5j, we have considered the ranking error of a scoring function s as defined by: 

R(s)=F{(Y-Y')(s(X)-s(X'))<0}, 

where (X',Y') is an i.i.d. copy of the random pair (X,Y). 

Interestingly, it can be proved that minimizing the ranking error R(s) is equivalent 
to maximizing the well-known AUG criterion. This is trivial once we write down the 
probabilistic interpretation of the AUG: 

AUG(s) =F{s(X) >s(X') I Y= 1, Y' = -l| = 1 - —-^ rR(s] . 

2p(l-p) 

We now propose a local version of the ranking error on a measurable set C C X: 
R(s,C) =f{(s(X)-s(X'))(Y-Y') >0, (X,X')eC2} , 
and the local analogue of the AUG criterion: 

LocAUG(s,u) =F{s(X) > s(X'), s(X) > Q(s, 1 - u) | Y = 1,Y' = -l} . 

This criterion obviously boils down to the standard criterion for u = 1. However, 
in the case where u < 1, we will see that there is no equivalence between maximizing 
the LocAUG criterion and minimizing the local ranking error s i— > R(s,u) = R(s,Cs,u)- 
Indeed, the local ranking error is not a relevant performance measure for finding the best 
instances. Minimizing it would solve the problem of finding the instances that are the 
easiest to rank. 

The following theorem states that scoring functions s* in the set S* maximize the 
criterion LocAUG and that the latter may be decomposed as a sum of a 'power' term 
and (the opposite of) a local ranking error term. 

Theorem 9 Let uq G (0, 1 ). We have, for any scoring function s: 

Vs* e <S*, LocAUG(s,uo) < LocAUG(s*,uo) . 

Moreover, the following relation holds: 

Vs, LocAUG(s, uo) = (3(s,uo) - TT-rr 7l^(s,uo) , 

2p(1 -p) 

where R(s,uo) = R(s,Cs,u<,). 
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PROOF. Set vo = 1 — uq. Observe first that: 

LocAUC(s,uo) = E(Hs(s(X)) I{s(X) > Q(s,vo)} | Y = 1 



+00 



Q(s,vo) 

We use that pGs = Fs — (1— p)Hs and we obtain: 



pLocAUC(s,uo) 



r+00 



Q(s,vo) 



r+00 



H,[z) F3(dz)-(1 -p) 



Hs(z) HJdz) 



Q(s,vo) 



(1 -a(s,v)) fl -(1 -a(s,vo))^ 



Vo 



This formula, combined with Lemma [HI estabhshes the first part of Theorem [H 
Besides, integrating by parts and making a change of variables, we get: 



+cx> 



Hs(z) Gs(dz) = 1 - (1 - a(s,uo))(l - |3(s,uo)) 



ra(s,uo) 



Q(s,vo) 



(1 - |3(s,a)) da 



a(s,uo ) 







|3(s, a) da + |3(s,uo)(l — a(s,uo)) 



On the other hand, one has 

a(s,Uo)(3(s,Uo) = —rr^ rF 

P(1 -P) 



{s(X)As(X') >Q(s,vo), Y' = 1, Y = -l} 



{s(X')>s(X), s(X)As(X') >Q(s,vo) I Y' = l, Y = -l} 



+ ;^|J3^f{s(X')<s(X), (X,X') eC^,^, Y' = 1 , Y = -l} 



|3(s,a) d(x+—— rR(s,uo) 

2p(l-p) 



Plugging this in the previous formula leads to the second statement of the theorem. | 
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Remark 7 (Truncating the AUG) In the theorem, we obviously recover the relation 
between the standard AUG criterion and the (global) ranking error when uo = 1 . Besides, 
by checking the proof, one may relate the generahzed AUG criterion to the truncated 
AUG. As a matter of fact, we have: 



Vs , LocAUC(s,uo) 



(3(s, a) da+ |3(s,uo) — a(s,uo)(3(s,uo). 





The values a(s,uo) and (3(s,uo) are the coordinates of the intersecting point between 
the ROC curve of the scoring function s and the line D(uo,p). Thus, the integral term 
represents the area of the surface dehmited by the ROC curve, the horizontal x-axis 
and the line x = a(s,uo) (see Figure [2]). The theorem reveals that evaluating the local 
performance of a scoring statistic s(X) by the truncated AUG as proposed in [9j is highly 
arguable since the maximizer of the functional s i— ) J^'^'^' |3(s,a) da is usually not in 



3.3 Generalized Wilcoxon statistic 

We now propose a different extension of the plain AUG criterion. Consider (Xi, Yi), . . ., 
(Xn, Yn), n i.i.d. copies of the random pair (X,Y). The intuition relies on a well-known 
relationship between Mann- Whitney and Wilcoxon statistics. Indeed, a natural empirical 
estimate of the AUG is the rate of concording pairs: 

AUC(s)=— ^ y_ I{Y|=-l,Yj = 1,s(X|]<s(Xj]}, 

^ 1<i,i<TL 

with 71+ = n — n_ = YJi=A ^i^i = +^)- other hand, we recall that the Wilcoxon 

statistic Tn(s) is the two-sample linear rank statistic associated to the score generating 
function 0(v) = v, Vv 6 (0, 1 ), obtained by summing the ranks corresponding to positive 
labels: 

-T I \ TTfv 11 rank(s(Xi)) 
Tn(s) = >_I{Yi= 1} — , 

1=1 

where rank(s(Xi)) denotes the rank of s(Xi) in the sample {s(Xj),l < j < n}. We refer 
to [m [35] for basic results related to hnear rank statistics. The following relation is 
well-known: 

— — -AUG s + = Tn s . 

rL+ 1 2 

Moreover, the statistic TrL(s)/n+ is an asymptotically normal estimate of 

W(s) = E(Fs(s(X)) I Y = l) . 
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Note the theoretical counterpart of the previous relation may be written as 

W(s) = (1 -p)AUC(s) +p/2 . 

Now, in order to take into account a proportion uq of the highest ranks only, one may 
consider the criterion related to the score generating function (v) = v I{v > 1 — uq}: 

W(s,uo) =E(Ouo(Fs(s(X))) I Y = l) 

which we shall call the W-ranking error at rate uq. 

Note that its empirical counterpart is given by TTv(s,uo)/rL_|_, with 

T t ^ V" irv 11 Aank(s(Xi)) 
T^^(s,uo) = >_I{Yi= 1} — 

Using the results from the previous subsection, we can easily check that the following 
theorem holds. 

Theorem 10 We have, for all s: 

Vs*e<S*, W(s,uo) < W(s*,uo) . 

Furthermore, we have: 

W(s,uo) = ||3(s,uo)(2 - |3(s,uo)) + (1 - p)LocAUC(s,uo) . 

PROOF. The result easily follows from the following representation of 

f+OO 



W(s,uo) 

and from the fact that: Fg =pGs + (1 — p)Hs. | 



F,(z) GJdz) 

Q(s,l-uo) 



Remark 8 (On the choice of a score generating function O) The idea of weight- 
ing the empirical AUG criterion with non-uniform weights is equivalent to considering 
smooth score generating functions O instead of our in the W-ranking error. Deriving 
optimality results for smooth criteria with our method is straightforward but we point out 
that, in this case, probabilistic interpretations are lost. In this approach, the stochastic 
processes arising are rank processes for which there is no theory available at this moment. 
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Remark 9 (Evidence against 'dividb-and-conquer' strategies) It is noteworthy 
that not all combinations of (3(s,uo) (or a(s,uo)) and R(s,uo) lead to a criterion with S* 
being the set of optimal scoring functions. We have provided two non-trivial examples 
for which this is the case (Theorems M and [10]). But, in general, this remark should 
prevent from considering naive 'divide-and-conquer' strategies for solving the local ranking 
problem. By naive 'divide-and-conquer' strategies, we refer here to stagewise strategies 
which would, first, compute an estimate C of the set containing the best instances, and 
then, solve the ranking problem over C as described in [5]. However, this idea combined 
with a certain amount of iterativeness might be the key to the design of efficient algorithms. 
In any case, we stress here the importance of making use of a global criterion, synthesizing 
our double goal: finding and ranking the best instances. 

4 Empirical risk minimization of the local AUG criterion 

In the previous section, we have seen that there are various performance measures which 
can be considered for the problem of ranking the best instances. In order to perform the 
statistical analysis, we will favor the representations of LocAUC and W which involve the 
classification error L(s,uo) and the local ranking error R(s,uo). By combining Theorems 
[9]and[l0l we can easily get: 

2p(1 -p)LocAUC(s,uo) = (1 -p)(p +uo) - (1 -p)L(s,uo) - R(s,uo) 

and 

2pW(s,uo) = C(p,uo) + (^^^ - l) L(s,uo) - ^L^ls.uo) - R(s,uo) 

where C(p,uo) is a constant depending only on p and uq. 

We exploit the first expression and choose to study the minimization of the following 
criterion for ranking the best instances: 

M(s] = M(s,uo) =R(s,Cs,^^) + (l -p)L(s,uo) . 

It is obvious that the elements of <S* are the optimal elements of the functional M( • ,uo) 
and we will now consider scoring functions obtained through empirical risk minimization 
of this criterion. 

More precisely, given n i.i.d. copies (Xi, Yi), . . . , (Xn, Yn) of (X, Y), we introduce the 
empirical counterpart: 

An(s) = An(s,Uo) = K{s) + — tn(s), 
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with n_ = I{Yi = -1} and 

Kis] = , "* y I{(s(XO - s(Xj])(Y| - Yj) > 0, s(X,) A s(Xj) > 0(s, 1 - uo)} . 
n n — 1 ^ — 

Note that ^tl(s) is expected to be close to the U-statistic of degree two 

Rn(s) = — r- >" ks( (Xt, YO , (Xj , Yj ) ] , 

n n — I ^ — 

with symmetric kernel 

U[x,y), [x',y')) = I{(s(x) - s[x'])[y-y'] > 0, s(x) As(x') > Q(s, 1 -uo)} . 

The statistic Rtl(s) corresponds to an unbiased estimate of the local ranking error 
R(s,uo). The next result provides a standard error bound for the excess risk of the 
empirical risk minimizer over a class S of scoring functions: 

= arg min An(s) . 

ses 

Proposition 11 Assume that conditions (i)-(ii) of Theorem 2 are fulfilled. Then, 
there exist constants c^ and Cz such that, for any 8 > 0, we have: 



M(§n)- inf M(s] < C1W- + C2 
sG5 y n 

with probability larger than 1—5. 



/ ln(1/5) 



PROOF, (sketch) The proof combines the argument used in the proof of Theorem [2] with 
the techniques used in estabhshing Proposition 2 in [1]. 



M(^n) - inf M(s) < 2 sup 

se5 \sg5 



+ 2(1 



sup 



+ sup|R(s) 



ses 







P 




TL 



The middle term may be bounded by applying the result stated in Theorem O while 
the last one can be handled by using Bernstein's exponential inequahty for an aver- 
age of Bernoulli random variables. By combining Lemma 1 in ^ with the Chernoff 
method, we can deal with the U-process term sup^^^ |R(s) — Rtl(s)|. Finally, the term 
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SUPsG5 



can also be controlled by repeating the argument in the proof of 



Theorem [21 The only difference here is that we have to consider the U-process term 
2 



sup 

(s,t1 



nfn — 1 



■}l{Ks,t((Xi,Yi),(Xj,Yj))-E[Ks,t((X,Y),(X',Y'))]} 



with 



Ks,t((x,y), (x'.y')) = I{(s(x) - s(x'))(y -y') > 0, sW A s(x') > t} . 



For deriving first-order results with such a process, we refer to the same type of argument 
as used in [4]. | 

Remark 10 (about the possibility op deriving past rates) By checking the proof 
sketch, it turns out that sharper bounds may be achieved for the U-process term. In- 
deed, it is a simple variation of our previous work in [4] where we have used Hoeffding's 
decomposition in order to grasp the deep structure of the underlying statistic. Here we 
will need, in addition, condition (iii) to hold for all u G (0,uo]. Indeed, if we localize our 
low-noise assumption from [4j, it takes the following form: there exist constants a G (0, 1) 
and B > such that, for all t > 0, we have 



Vx e C: 



P{|ti(X)-ti(x)| <t}<Bt— . 



It is easy to see that this is equivalent to condition (iii) for all u G (0,uo]: there exist 
constants a G (0,1) and B > such that, for all t > 0, we have 



Vu e (0,uo] 



HI^(X)-Q(ti,1 -u)| <t}<Bt— 



However, in the present formulation where p is assumed to be unknown, it looks like this 
improvement will be spoiled by the 'proportion term' which will still be of the order of a 

Appendix - Proof of Proposition [6] 

First, for all (s,v) e <S x (0, 1) set 

1 

V^(s,v) = -)" Y|I{s(Xi) < Q(s,v)}-K(s,v) . 
1=1 



We have the following decomposition: 



Vve [0,1] , R^(s,v)-K(s,v) =Vn(s,Fs0^7l(v))+K(s,Fs0^7l(v))-K(s,^ 



24 



We shall first prove that 

Vn(s, o (vo)) = Vn(s, vo) + 0^^-^)- 

We denote by A(s, e) the event | Fg o '^^Vvq) — vo < e|. On the event A(s, e), we have: 

Vn(s,F3 0^-l(vo))-Vn(s,Vo) < sup |Vn(s,v) — Vn(s,Vo)| . 

V ; |v-vol<e 

We bound the right hand side for fixed e, by making use of an argument from [M]. First, 
we need to put things into the right format. Set: 

1 ^ 

Vn(s,v) - Vn(s,Vo) = - Y (Ui(s,v) -Ui(s,Vo)) , 

n ^ — 



where Ui(s,v) = YiI{s(Xi) < Q(s,v) < 0}-E(YI{s(X) < Q(s,v)}) for s 6 <S and v 6 (0,1). 
We observe that 

|ui(s,v) -Ui(s,vo)| < di(v,vo), 



where 



Denote by 



dt(v,vo) =I{s(Xt) e [Q(s,vAvo),Q(s,vVvo)]} + |v-vol . 



^Kvo) = -y"l{s(Xt) e [Q(s,vAvo),Q(s,vVvo)]} + |v-vo| . 

TL ^ — 



i=1 



a distance over M. Set also: 



^(e) = sup ^(v,vo) . 

V ; |v— Vo |<e 



and observe that 



^(e) = I{s(Xt) e [Q(s,vo-e),Q(s,vo + e)]}+e. 



We then have, by applying Lemma 8.5 from for nt^/6^(e) sufficiently large. 



sup |Vn(s,v) - Vn(s,Vo)| > t 

V : |v— vo|<e 



Xi , . . . , Xn > < C exp 



J cnt^ 



for some positive constants c and C. It remains to integrate out and, for this purpose, we 
introduce the event: 

Vx>0, A(x) = {3e-x < ft(e) < 3e + x} . 
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We then have: 

Now, we have, by Bernstein's inequahty: 

f|a(x)| =2F|;iB(n,2e) -2e >x| <2exp|-^j^^| 

where we have used the notation B(n, 2e) for a binomial (n, 2e) random variable. We can 
take X = 0(t/-ye) and assume also x = o(e) to get, for nt^/e^ large enough, 

I f rn+2" 

sup |Vn(s,v) - Vn(s,vo)| > t > < Cexp 



V : |v— vol<e 



cnt 



for some positive constants c and C. This can be reformulated, by writing that the 
following bound holds, with probability larger than 1 — 6/2, 



lA/ r ^ ^/ f m / /log 2C/6 
sup |Vn(s,v) - Vti(s,vo)| < eW . 

V : |v-vol<e V '"-C 

We recall that, by the triangle inequality and Dvoretsky-Kiefer-Wolfowitz theorem, if we 
take e = c \f^^^, we have F{A(s, e)} > 1-8/2. It follows that, with probability larger 
than 1 — 5, we have, for some constant k: 

for any s e 5. Now it remains to deal with the second term K(s,Fs o ^^^vo)) — K(s,vo). 
Therefore, by the differentiability assumption, we have: Vs e cS, 

sup {K(s,v)-K(s,vo)-(v-vo)K'(s,vo)} = 0(52) , as 6 ^ . 

|v— vol<6 

Since |Fs o^g"^(vo)) — vqI = Op(n^^/^), we get that 

K(s,Fs of7''(vo)) - K(s,vo] = K'(s,vo](Fs o^7''(vo) - vo) + Op(n"'') , as n ^ oo . 
Moreover, as 

T's o "^7^ (vo) - vo = -(^s o (vo) - vo) + Op(n"^ ) , 
we finally obtain that 

K(s,Fs0^7i(vo))-K(s,vo) =-K'(s,vo)(^soF7i(vo)-vo) + Op(n-i) . | 
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